Skip to content

[MLAS] Integrate KleidiAI BF16 SME2 Kernel Through Mlas SBGEMM Path#26773

Open
patryk-kaiser-ARM wants to merge 1 commit intomicrosoft:mainfrom
patryk-kaiser-ARM:kai_bf16_kernel_integration
Open

[MLAS] Integrate KleidiAI BF16 SME2 Kernel Through Mlas SBGEMM Path#26773
patryk-kaiser-ARM wants to merge 1 commit intomicrosoft:mainfrom
patryk-kaiser-ARM:kai_bf16_kernel_integration

Conversation

@patryk-kaiser-ARM
Copy link
Contributor

@patryk-kaiser-ARM patryk-kaiser-ARM commented Dec 11, 2025

Description
This PR integrates Arm® KleidiAI™ SME2 BF16 kernel through MLAS SBGEMM path.

Rework of #24346

Motivation and Context
This kernel provides performance improvements on SME-enabled devices.

@patryk-kaiser-ARM patryk-kaiser-ARM marked this pull request as draft December 11, 2025 11:48
@patryk-kaiser-ARM patryk-kaiser-ARM marked this pull request as ready for review January 6, 2026 13:59
@patryk-kaiser-ARM
Copy link
Contributor Author

@microsoft-github-policy-service agree company="Arm"

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR integrates the Arm® KleidiAI™ SME2 BF16 kernel into the MLAS SBGEMM (single-precision to bfloat16 GEMM) path. The integration provides performance improvements for bfloat16 matrix multiplication operations on ARM devices with SME2 support.

Changes:

  • Added new sbgemm_kleidiai.cpp implementation with KleidiAI BF16 SME2 kernel
  • Introduced BIsPacked flag to MLAS_SBGEMM_DATA_PARAMS to track pre-packed matrix B state
  • Added override mechanism in SBGEMM path for KleidiAI kernels on SME2-enabled platforms

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
onnxruntime/core/mlas/lib/kleidiai/sbgemm_kleidiai.cpp New implementation of SBGEMM using KleidiAI BF16 SME2 kernel
onnxruntime/core/mlas/lib/kleidiai/mlasi_kleidiai.h Added function declarations for SBGEMM KleidiAI overrides
onnxruntime/core/mlas/lib/kai_ukernel_interface.h Added SBGEMM ukernel interface declaration
onnxruntime/core/mlas/lib/kai_ukernel_interface.cpp Added SBGEMM ukernel instantiation for SME2
onnxruntime/core/mlas/lib/mlasi.h Added typedef declarations for SBGEMM override functions
onnxruntime/core/mlas/lib/sbgemm.h Added override mechanism to call KleidiAI SBGEMM functions
onnxruntime/core/mlas/lib/platform.cpp Registered KleidiAI SBGEMM overrides for SME2-enabled platforms
onnxruntime/core/mlas/inc/mlas.h Added BIsPacked field to MLAS_SBGEMM_DATA_PARAMS struct
onnxruntime/core/providers/cpu/math/matmul.cc Set BIsPacked flag when using pre-packed matrix B
onnxruntime/test/mlas/unittest/test_sbgemm.h Updated tests to initialize and set BIsPacked flag
cmake/onnxruntime_mlas.cmake Added sbgemm_kleidiai.cpp to build system

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@hariharans29 hariharans29 changed the title Integrate KleidiAI BF16 SME2 Kernel Through Mlas SBGEMM Path [MLAS] Integrate KleidiAI BF16 SME2 Kernel Through Mlas SBGEMM Path Jan 21, 2026
@hariharans29
Copy link
Member

Hi @patryk-kaiser-ARM / @damdoo01-arm - Can you please resolve conflicts for this PR if it is still on the agenda ? We can target merging this PR next. Thanks.

Signed-off-by: Patryk Kaiser <patryk.kaiser@arm.com>
@patryk-kaiser-ARM patryk-kaiser-ARM force-pushed the kai_bf16_kernel_integration branch from 51617ca to 509c420 Compare February 4, 2026 15:15
@patryk-kaiser-ARM
Copy link
Contributor Author

Hi @hariharans29 I resolved conflicts. This one is still on the agenda - I am currently investigating adding support for fastmath to more operators so that this change can have a larger impact, however it would be a good idea to get this one in first and then open up consequent PRs to bring more ops down this path for fastmath.

@patryk-kaiser-ARM
Copy link
Contributor Author

Can workflows be approved please

@hariharans29
Copy link
Member

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 4 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants